Digitization and Search, A Non-Traditional Use of HPC

نویسندگان

Liana Diesendruck

Luigi Marini

Rob Kooper

Mayank Kejriwal

Kenton McHenry

چکیده

We describe our efforts in developing an open source cyberinfrastructure to provide a form of automated search of handwritten content within large digitized document archives. Such collections are a treasure trove of data ranging from decades ago to as far as the present. The information contained in these collections is also very relevant to both researchers who might extract numerical or statistical data from such sources as well as the general public. With the push to digitize our paper archives we are, however , faced with the fact that though these digital versions are easier to share, they are not trivially searchable as the digitiza-tion process produces image data and not text. This inability to find and/or identify contents within these collections makes this data largely unusable without a lengthy and costly manual transcription process carried out by human beings. To carry out the search we build on top of a computer vision technique called word spotting. A form of content based image retrieval, it avoids the still difficult task of directly recognizing the text by allowing a user to search using a query image containing handwritten text and ranking a database of images in terms of those that contain more similar looking content. In order to make this search capability available on a large archive, three computationally expensive pre-processing steps are required, Figure 1. First, forms are segmented into individual units of handwritten information. In the case of the 1930 Census data collection, which contains approximately 3.6 million spreadsheet-like forms, this entails breaking the form images into sub-images of individual cells that contain the information about the individuals recorded in the Census. Second, the extracted sub-images are processed so as to extract features and descriptors that represent the handwritten contents within them. The utilized word spotting method results in a 30 dimensional vector derived from the frequency components of the darker ink pixels [1]. The distance between two such signature vectors can be used to determine how similar the handwritten contents of their cell sub-images are. Third, an indexing step organizes these extracted signatures into a binary tree structure to enable fast user queries. For the 1930 Census data this involves organizing nearly 7 billion sub-images using a hierarchical agglomerative clustering. Organizing the entire collection at once isn't practical, thus we instead break this step into multiple index construction steps based on states, categories, and microfilm reels passing the …

برای دانلود رایگان متن کامل این مقاله و بیش از 32 میلیون مقاله دیگر ابتدا ثبت نام کنید

ثبت نام

اگر عضو سایت هستید لطفا وارد حساب کاربری خود شوید

منابع مشابه

From Traditional to Digital Environment: An Analysis of the Evolution of Business Models and New Marketing Strategies

This paper analyzes the major trends in the business environment that shaped the business models adopted by companies and their new marketing strategies. It adopts a desktop research methodology by collecting data from previous academic papers, statistical, and analytical reports. It starts by analyzing the globalization trend that forced most of the emerging economies to liberalize and privati...

متن کامل

Digitization and Path Disruption: An Examination in the Funeral Industry

While the digitization of the business landscape provides firms with numerous business opportunities, it has severely disrupted established business practices of many traditional “offline businesses.” To shed light on the disruptive nature of digitization and the challenges that it entails for traditional offline businesses, we draw on path dependence theory to examine how digitization disrupts...

متن کامل

Education, the Key to Success in Non-Pharmacological Interventions in the Control and Treatment of Type 2 Diabetes: A Systematic Review

Background: The prevalence of diabetes 2 is a global health challenge that requires continuous care. The use of non- Pharmaceutical interventions in the control and treatment of type-2 diabetes can be less costly and have less complications. Therefore, this study aimed to identify a variety of non- Pharmaceutical interventions in the control and treatment of type-2 diabetes through systematic r...

متن کامل

باکتری‌های هتروتروف در آب آشامیدنی شهر تبریز

Background and Aim: Recently the use of heterotrophic plate count (HPC) has received much attention as a supplementary indicator of the MPN test in water quality control. The US Environmental Protection Agency (USEPA) has declared 500 cfu/mL as the maximum acceptable level for heterotrophic bacteria in distribution networks. Currently the HPC determination is not among the routine control items...

متن کامل

The perceptibility curve test applied to CCD and two methods of digitization of dental film-based radiographs

Objectives: Several methods of image acquisition are accessible in dentistry. There is no overall acceptable method for image digitization so all different types of images can be comparable. The objective of this study was to compare the diagnostic accuracy of different methods of image digitization. Methods: This accuracy diagnostic test study used perceptibility curve test which first intr...

متن کامل

Iterated Local Search Algorithm for the Constrained Two-Dimensional Non-Guillotine Cutting Problem

An Iterated Local Search method for the constrained two-dimensional non-guillotine cutting problem is presented. This problem consists in cutting pieces from a large stock rectangle to maximize the total value of pieces cut. In this problem, we take into account restrictions on the number of pieces of each size required to be cut. It can be classified as 2D-SLOPP (two dimensional single large o...

متن کامل

ذخیره در منابع من

ذخیره در منابع من قبلا به منابع من ذحیره شده

{@ msg_add @}

با ذخیره ی این منبع در منابع من، دسترسی به آن را برای استفاده های بعدی آسان تر کنید

عنوان ژورنال:

دوره شماره

صفحات -

تاریخ انتشار 2012

Digitization and Search, A Non-Traditional Use of HPC

نویسندگان

چکیده

منابع مشابه

From Traditional to Digital Environment: An Analysis of the Evolution of Business Models and New Marketing Strategies

Digitization and Path Disruption: An Examination in the Funeral Industry

Education, the Key to Success in Non-Pharmacological Interventions in the Control and Treatment of Type 2 Diabetes: A Systematic Review

باکتری‌های هتروتروف در آب آشامیدنی شهر تبریز

The perceptibility curve test applied to CCD and two methods of digitization of dental film-based radiographs

Iterated Local Search Algorithm for the Constrained Two-Dimensional Non-Guillotine Cutting Problem

عنوان ژورنال:

اشتراک گذاری